Reading: <a href ="http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf"> Generative and Disciminative Classifiers </a> by Tom Mitchell.
Can you use Naïve Bayes for a combination of discrete and real-valued Xi?
How can we easily model the assumption that just 2 of the n attributes as dependent?
What does the decision surface of a Naïve Bayes classifier look like?
How would you select a subset of Xi’s?
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm
import seaborn as sns
sns.set_theme()
x1 = np.linspace(-10,10,1000)
x2 = np.linspace(-10,10,1000)
Generative classifiers:
Learns $f: X \rightarrow Y$, where
$P(Y|X)$ is modeled as:
\begin{eqnarray} P(Y=1|X) = \frac{1}{1+\exp(- (w_0+\sum_i w_i X_i))} = \frac{\exp(w_0+\sum_i w_i X_i)}{\exp(w_0+\sum_i w_i X_i)+1} \end{eqnarray}It uses the logistic (or sigmoid) function:
\begin{eqnarray} \frac{1}{1+\exp{-z}} \end{eqnarray}z = np.linspace(-10,10,1000)
plt.plot(z,1/(1+np.exp(-z)))
plt.xlabel(r'$Z = w_0+\sum_i w_i X_i$',fontsize=16)
plt.ylabel(r'logistic$(Z) = P(Y=1|X)$',fontsize=16)
Text(0, 0.5, 'logistic$(Z) = P(Y=1|X)$')
Asking $P(Y=1|X)>P(Y=0|X)$ is the same as asking if $\ln \frac{P(Y=1|X)}{P(Y=0|X)} >0$.
i.e. is $$w_0+\sum_i w_i X_i ~ ~ >0?$$
This is a linear decision boundary!
from scipy.stats import multivariate_normal
# similar to previous example
mu_1_1 = -4; sigma_1_1 = 2;mu_2_1 = 4; sigma_2_1 = 2
mu_1_0 = 4; sigma_1_0 = 2;mu_2_0 = -4; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1], cov=cov_positive, size = (20))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0], cov=cov_negative, size = (20))
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
# hand picked line
plt.plot(x1, x1*0.8+0.5)
from labellines import labelLine
labelLine(plt.gca().get_lines()[-1],0.6,label=r'$w_0+\sum_i w_i X_i = 0$',fontsize=16)
plt.axis([-10,10,-10,10],'equal')
plt.xlabel(r'$x_1$',fontsize=20); plt.ylabel(r'$x_2$',fontsize=20)
plt.title('Data space',fontsize=20);
The weights $w_i$ are optimized such that when $w_0+\sum_i w_i X_i > 0$ the example is more likely to be positive and when $w_0+\sum_i w_i X_i < 0$ it's more likely to be negative.
$w_0+\sum_i w_i X_i = 0, P(Y=1|X) = \frac{1}{2}$
$w_0+\sum_i w_i X_i \rightarrow \infty, P(Y=1|X) \rightarrow 1$
$w_0+\sum_i w_i X_i \rightarrow -\infty, P(Y=1|X) \rightarrow 0$
Let's focus on binary classfication
$ P(Y=1|X) = \frac{\exp(w_0+\sum_i w_i X_i)}{\exp(w_0+\sum_i w_i X_i)+1} $
$ P(Y=1|X) = \frac{1}{\exp(w_0+\sum_i w_i X_i)+1} $
How to learn $w_0$, $w_1$...$w_d$?
Training data: $\{(X^{(j)},Y^{(j)})\}_{j=1}^n$, with $~X^{(j)}=\left(X_1^{(j)},X_2^{(j)},...X_d^{(j)} \right)$
Maximum Likelihood Estimation:
$$\hat {\bf w}_{\text{MLE}} = \underset{{\bf w}}{\operatorname{argmax}} \prod_{j=1}^n P(X^{(j)},Y^{(j)}| {\bf w})$$
Problem: We don’t have a model for $P(X)$ or $P(X|Y)$ – only for $P(Y|X)$
Discriminative philosophy – Don’t waste effort learning $P(X)$, focus on $P(Y|X)$
Maximum (Conditional) Likelihood Estimation:
$$\hat {\bf w}_{\text{MCLE}} = \underset{{\bf w}}{\operatorname{argmax}} \prod_{j=1}^n P(Y^{(j)}|X^{(j)},{\bf w})$$
Gradient: $$\nabla_{{\bf w}} l({\bf w}) = \left[ \frac{\partial l({\bf w}) }{\partial w_0},...,\frac{\partial l({\bf w})}{\partial w_d} \right] $$
Update rule for gradient ascent, with learning rate $\eta>0$ $$\Delta{\bf w} = \eta\nabla_{{\bf w}} l({\bf w}) $$
$$ w_i^{(t+1)} = w_i^{(t)}+\eta \frac{\partial l({\bf w})}{\partial w_i}\mid_{w_t} $$Update rule for gradient descent, with learning rate $\eta>0$ $$\Delta{\bf w} = - \eta\nabla_{{\bf w}} l({\bf w}) $$
$$ w_i^{(t+1)} = w_i^{(t)} - \eta \frac{\partial l({\bf w})}{\partial w_i}\mid_{w_t} $$(maximizing $l({\bf w})$ is the same as minimizing $l'({\bf w}) = -l({\bf w})$)
Review, let's start with a simple function:
$$f(w) = 0.2(w - 2) ^2 + 1$$We know that this function is convex (2nd derivative exists and is positive).
f = lambda w: 0.2*(w-2)**2+1
dfdw = lambda w: 0.4*w - 0.8
w = np.linspace(-4,8,1000)
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
plt.title(r'Minimize $f(w)$, start with a random point $w_0$',fontsize = 20);
w_0 = 6
plt.plot(w_0, f(w_0), "o",markersize=10)
def draw_vector_2D(ax, x, y, lenx, leny,name,color='k'):
# grad = np.array([-np.sin(x),np.cos(y)])
ax.quiver(x,y,lenx, leny, color=color,angles='xy', scale_units='xy', scale=1)
ax.text(x+lenx/2, y+leny/2+0.5,name,fontsize = 16,color=color)
draw_vector_2D(plt, w_0, 0, dfdw(w_0),0, r'$\nabla f(w_0)$','k')
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
plt.title(r'Minimize $f(w)$, start with a random point $w_0$, step size $\eta=0.5$',fontsize = 20);
w_0 = 6
plt.plot(w_0, f(w_0), "o",markersize=10)
draw_vector_2D(plt, w_0, 0, dfdw(w_0),0, r' ','k')
eta=0.5
draw_vector_2D(plt, w_0, 0, - dfdw(w_0)*eta,0, r'$-\eta\nabla f(w_0)$','r')
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
w_1 = w_0 - dfdw(w_0)*eta
plt.title(r'Set $w_{t+1} = w_{t} - \eta \nabla f(w_t)$ and repeat', fontsize = 20);
plt.plot(w_0, f(w_0), "o",markersize=10)
plt.plot(w_1, f(w_1), "o",markersize=10)
draw_vector_2D(plt, w_1, 0, - dfdw(w_1)*eta,0, r'$-\eta\nabla f(w_1)$','g')
plt.plot(w, f(w), linewidth=3 )
plt.xlabel(r'$w$')
plt.ylabel(r'$f(w)$')
# w_1 = w_0 - dfdw(w_0)*eta
w_t = np.zeros(10)
w_t[0] = 7 # w_0
eta = 4
for i in range(1,10):
w_t[i] = w_t[i-1] - eta * dfdw(w_t[i-1] )
plt.title(r'Gradient descent with $\eta={}$'.format(eta), fontsize = 20);
plt.plot(w_t, f(w_t), "o-",markersize=10)
[<matplotlib.lines.Line2D at 0x13f4f7b00>]
x = np.linspace(-1,2,100);y = np.linspace(-1,2,100); X,Y = np.meshgrid(x, y)
f_XY = np.cos(X)+np.sin(Y)
plt.figure(figsize=(6,5))
cs = plt.contourf(X, Y, f_XY,20,cmap='RdBu_r',vmin=-1,vmax=1,alpha=0.6);plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel('x');plt.ylabel('y')
plt.title('color indicates magnitude of funciton \n gradient is perpendicular to level set')
draw_vector_2D(plt, 1.45,0.5,-np.sin(1.45),np.cos(0.5),'','k')
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
%config InlineBackend.print_figure_kwargs = {'bbox_inches':None}
fig = plt.figure(figsize=(6,6))
ax = fig.gca(projection='3d')
Z = np.cos(X)+np.sin(Y)
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap='gray',
linewidth=0, antialiased=False, rcount=200, ccount=200)
ax.set_xlabel('x');ax.set_ylabel('y');ax.set_zlabel('f(x,y)');
ax.set_title('Alternative visualization, 3rd dim is magnitude', fontsize=16);
Simple simulated example
# Previous example
mu_1_1 = -5; sigma_1_1 = 2;mu_2_1 = 5; sigma_2_1 = 2
mu_1_0 = 5; sigma_1_0 = 2; mu_2_0 = -5; sigma_2_0 = 2
cov_positive = np.array([[sigma_1_1**2,3], [3,sigma_2_1**2]] )
cov_negative = np.array([[sigma_1_0**2,3], [3,sigma_2_0**2]] )
# Sample data from these distributions
X_positive = multivariate_normal.rvs(mean=[mu_1_1,mu_2_1], cov=cov_positive, size = (20))
X_negative = multivariate_normal.rvs(mean=[mu_1_0,mu_2_0], cov=cov_negative, size = (20))
X = np.vstack([X_positive, X_negative])
Y = np.vstack([np.ones((X_positive.shape[0],1)),np.zeros((X_negative.shape[0],1))])
plt.figure(figsize=(6,6))
plt.scatter(X_positive[:, 0], X_positive[:, 1],facecolors='r', edgecolors='w')
plt.scatter(X_negative[:, 0], X_negative[:, 1],facecolors='b', edgecolors='w')
plt.plot(x1, x1*0.8)
plt.axis([-10,10,-10,10],'equal')
plt.xlabel(r'$x_1$',fontsize=20)
plt.ylabel(r'$x_2$',fontsize=20)
plt.title('Data space',fontsize=20);
w1x = np.linspace(-60,40,100)
w2x = np.linspace(-60,40,100)
W1,W2 = np.meshgrid(w1x, w2x)
## ommiting w_0 just for illustration
def loglikelihood(w1,w2):
w = np.array([[w1],[w2]]) # make w_vec
loglihood = np.sum(Y*X.dot(w) - np.log(1+ np.exp(X.dot(w))))
return loglihood
L_w = np.vectorize(loglikelihood)(*np.meshgrid(w1x, w2x))
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='Greens_r',vmin=np.min(L_w),vmax=0,alpha=0.6);
plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20)
plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
for $i = 1...d$:
\begin{eqnarray} \frac{\partial l({\bf w})}{\partial w_0} &=& \sum_jx_i^j \left [ y^j - \hat P(Y^j=1|{\bf x}^j, {\bf w^{(t)}}) \right] \end{eqnarray}def gradient_likelihood(w1,w2,X,Y):
w = np.array([[w1],[w2]])
P_Y_1 = np.exp(X.dot(w))/(1+ np.exp(X.dot(w)))
gw1 = X[:,0:1].T.dot(Y-P_Y_1)
gw2 = X[:,1:2].T.dot(Y-P_Y_1)
return gw1, gw2
plt.figure(figsize=(9,7))
cs = plt.contourf(W1, W2, L_w,20,cmap='Greens_r',vmin=np.min(L_w),
vmax=0,alpha=0.6); plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20);plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
w1 = -5; w2 = -5
gw1, gw2 = gradient_likelihood(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r'$\nabla L(w)/10$','k');
Iterate until convergence (until change $< \epsilon$)
\begin{eqnarray} w_0^{(t+1)} = w_0^{(t)} + \eta \sum_j \left [ y^j - \hat P(Y^j=1|{\bf x}^j, {\bf w^{(t)}}) \right] \end{eqnarray}for $i = 1...d$:
\begin{eqnarray} w_i^{(t+1)} = w_i^{(t)} + \eta \sum_jx_i^j \left [ y^j - \hat P(Y^j=1|{\bf x}^j, {\bf w^{(t)}}) \right] \end{eqnarray}What is the $[ y^j - \hat P(Y^j=1|{\bf x}^j, {\bf w^{(t)}})]$ term doing?
Compare to:
Typically, the loss / function to minimize is a sum over the loss for each individual point: $ L({\bf w}) = \frac{1}{n}\sum_{i=1}^n L_i({\bf w})$.
We use $\nabla L_i({\bf w})$ instead of $\nabla L({\bf w})$. Since we sample over the points uniformly, they all have equal proability, and the expected value of $\nabla L_i({\bf w})$ is:
$E_I\nabla L_i({\bf w}) = \sum_{i=1}^n P(i)L_i({\bf w}) = \frac{1}{n}\sum_{i=1}^n L_i({\bf w}) = \nabla L({\bf w})$.
Algorithm:
To reduce the noise in the gradients at individual datapoints, one approach is to:
Algorithm:
This is Sebastian Ruder's blog post. You can see animations by Alec Radford on this blogpost for demos. You can also explore this interactive visualization from Roberts Dionne.
In the examples above, we had a fixed learning rate for simplicity, but the choice of the learning rate is very important and affects the behavior and convergence time of the algorithm.
When using SGD, we can have a large learning rate because the gradients are mostly pointing in the same direction. Later in training, we are closer to the optimum and the gradients are more noisy. We want to slow down the training to average the noise. We can use the validation dataset to perform this:
Ravines (where the surface curves steeply in one dimension) are problematic because SGD can oscillate.
$G_t$ is a diagonal matrix with the sum of the squares of the gradient of the $i$th element at each $i,i$ entry.
When using Adagrad, we don't need to tune the learning rate. However, as training goes on, the updates become smaller and smaller and the algorithm can stagnate. This is solved by the following algorithms:
Decision boundary:
$$ w_0 + \sum_{i=1}^d w_ix_i^j = 0 $$What about:
$$ (10w_0) + \sum_{i=1}^d (10w_i)x_i^j = 0 $$or
$$ (10000w_0) + \sum_{i=1}^d (10000w_i)x_i^j = 0 $$Same boundary! Which one has higher log likelihood?
$$l({\bf w}) \equiv \ln \prod_j P(y^{j}|{\bf x}^{j},{\bf w})$$plt.figure(figsize=(8,6))
z = np.linspace(-10,10,1000)
plt.plot(z,1/(1+np.exp(-z)), label = 'a=1')
plt.plot(z,1/(1+np.exp(-10*z)), label = 'a=10')
plt.plot(z,1/(1+np.exp(-100*z)), label = 'a=100')
plt.legend(); plt.xlabel(r'$Z$',fontsize=16);
plt.ylabel(r'logistic$(a Z) = P(Y=1|X)$',fontsize=16);
/Users/lwehbe/env/py3/lib/python3.7/site-packages/ipykernel_launcher.py:5: RuntimeWarning: overflow encountered in exp """
$w\rightarrow \infty$ if the data is linearly separable
For MAP, need to define prior on $W$
A kind of Occam’s razor (simplest is best) prior
Helps avoid very large weights and overfitting
MAP estimation picks the parameter $W$ that has maximum posterior probability $P(W|Y,X)$ given the conditional likelihood $P(Y|W,X)$ and the prior $P(W)$.
Using Bayes rule again:
\begin{eqnarray} W^{MAP} &=& \underset{W}{\operatorname{argmax}} P(W|Y,W) = \underset{W} {\operatorname{argmax}} \frac{P(Y|W,X)P(W,X)}{P(Y,X)} \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W,X) \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W)P(X) ~~~~~ \text{ assume } P(W,X) = P(W)P(X) \\ &=& \underset{W}{\operatorname{argmax}} P(Y|W,X)P(W)\\ &=& \underset{W}{\operatorname{argmax}} \ln P(Y|W,X) + \ln P(W) \end{eqnarray}Zero Mean Gaussian prior on $W$: $W\sim \frac{1}{2\pi\sigma^2}\exp\big(-\frac{1}{2\sigma^2}\sum_iw_i^2 \big )$
\begin{eqnarray} W^{MAP} = \underset{W}{\operatorname{argmax}} \ln P(Y|W,X) - \left(\frac{1}{2\sigma^2}\sum_iw_i^2 \right ) \end{eqnarray}This “Pushes” parameters towards zero and corresponds to Regularization
lmbda = 10 # this is 1/(2*sigma**2)
def logposterior(w1,w2):
w = np.array([[w1],[w2]]) # make w_vec
loglihood = np.sum(Y*X.dot(w) - np.log(1+ np.exp(X.dot(w))))
loglihood += - (w1**2 + w2**2)*lmbda
return loglihood
L_w = np.vectorize(logposterior)(*np.meshgrid(w1x, w2x))
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='Blues_r',vmin=np.min(L_w),
vmax=0,alpha=0.6);
plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20)
plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
def gradient_posterior(w1,w2,X,Y):
w = np.array([[w1],[w2]])
P_Y_1 = np.exp(X.dot(w))/(1+ np.exp(X.dot(w)))
gw1 = X[:,0:1].T.dot(Y-P_Y_1)- 2*lmbda*w1
gw2 =X[:,1:2].T.dot(Y-P_Y_1) - 2*lmbda*w2
return gw1, gw2
plt.figure(figsize=(8,6))
cs = plt.contourf(W1, W2, L_w,20,cmap='RdBu_r',vmin=-np.max(np.abs(L_w)),
vmax=np.max(np.abs(L_w)),alpha=0.6); plt.colorbar()
contours = plt.contour(cs, colors='k')
plt.xlabel(r'$w_1$',fontsize=20);plt.ylabel(r'$w_2$',fontsize=20)
plt.title('Parameter space',fontsize=20);
w1 = 7; w2 = -16
gw1, gw2 = gradient_likelihood(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r' ','r');
gw1, gw2 = gradient_posterior(w1,w2,X, Y)
draw_vector_2D(plt, w1,w2,gw1/10,gw2/10, r'$\nabla L(w)/10$','b');
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
Both (1) and (2) are satisfied.
Recall the decision boundary of GNB when the variances are exactly equal:
\begin{eqnarray} \ln \frac{P(Y=1|X_1...X_d)}{P(Y=0|X_1...X_d)} &=& C + G(X)\\ G(X) &=& -\frac{1}{2} \sum_i \big( x_{i} ^2(\frac{1}{\sigma_{i1}^2} - \frac{1}{\sigma_{i0}^2} ) - 2 x_{i} (\frac{\mu_{i1}}{\sigma_{i1}^2} - \frac{\mu_{i0}}{\sigma_{i0}^2} ) + (\frac{\mu_{i1}^2}{\sigma_{i1}^2} - \frac{\mu_{i0}^2}{\sigma_{i0}^2} ) \big ) \\ G(X) &=& \sum_i \big( x_{i} \frac{\mu_{i1}-\mu_{i0}}{\sigma_{i}^2} \big ) - \sum_i \big( \frac{\mu_{i1}^2 - \mu_{i0}^2}{2\sigma_{i}^2} \big ) \end{eqnarray}The decision boundary is linear, of the form: $\beta_0 + \sum_i \beta_i x_i = 0$.
The parameters are determined using the distance between centers, weighted by variance on each dimension.
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
(2) is satisfied, but not (1).
Why doesn't GNB2 learn the same boundary as LR?
The decision boundary for GNB 2 is linear, of the form: $\beta_0 + \sum_i \beta_i x_i = 0$.
But each parameters is linked to the individual means and standard deviation of each dimension $x_i$, e.g.:
$$ \beta_i = \frac{\mu_{i1}-\mu_{i0}}{\sigma_{i}^2} $$GNB 2 is therefore less flexible than LR.
Consider these two assumptions:
Consider three learning methods:
How do these methods perform if we have plenty of data and:
Neither (1) nor (2) is satisfied.
Which method works better if we have infinite training data, and...
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
GNB2 converges more quickly to its perhaps-less-accurate asymptotic error. (more bias than LR)
And GNB is both more biased (assumption 1) and less (no linearity assumption) than LR, so either might outperform the other.
LR is a linear classifier: decision rule is a hyperplane